Descripcion base de datos
Titulo de la base de datos:
Forest Covertype data, Remote Sensing and GIS Program Department of Forest Sciences College of Natural Resources Colorado State University.
El dataset consiste en medidas cartograficas de areas silvestres en el Bosque Nacional Roosevelt en el norte de Colorado en USA y el tipo de covertura vegetal presente. Se realizan en celdas de 30 X 30 metros. Estos bosques tienen poca intervencion humana, por ende las areas caracterizan no usos (como agricultura, poblacion etc) sino clase de bosque predominante. En total son 12 medidas cartograficas y 7 grandes tipos de covertura vegetal.
El dataset:
Numero de columnas:
length(df)
[1] 55
Numero de observaciones:
nrow(df)
[1] 581011
Atributos:
names(df[,c(1:10,55)])
[1] "Elevation" "Aspect"
[3] "Slope" "Horizontal_Distance_To_Hydrology"
[5] "Vertical_Distance_To_Hydrology" "Horizontal_Distance_To_Roadways"
[7] "Hillshade_9am" "Hillshade_Noon"
[9] "Hillshade_3pm" "Horizontal_Distance_To_Fire_Points"
[11] "Forest_Cover_Type"
Otros atributos son cuatro areas silvestres y 40 tipos de suelo.
El tipo de atributo, las unidades y su descripcion:
| Elevation |
quantitative |
meters |
Elevation in meters |
| Aspect |
quantitative |
azimuth |
Aspect in degrees azimuth |
| Slope |
quantitative |
degrees |
Slope in degrees |
| Horizontal_Distance_To_Hydrology |
quantitative |
meters |
Horz Dist to nearest surface water features |
| Vertical_Distance_To_Hydrology |
quantitative |
meters |
Vert Dist to nearest surface water features |
| Horizontal_Distance_To_Roadways |
quantitative |
meters |
Horz Dist to nearest roadway |
| Hillshade_9am |
quantitative |
0 to 255 index |
Hillshade index at 9am, summer solstice |
| Hillshade_Noon |
quantitative |
0 to 255 index |
Hillshade index at noon, summer soltice |
| Hillshade_3pm |
quantitative |
0 to 255 index |
Hillshade index at 3pm, summer solstice |
| Horizontal_Distance_To_Fire_Points |
quantitative |
meters |
Horz Dist to nearest wildfire ignition points |
| Wilderness_Area (4 binary columns) |
qualitative |
0 (absence) or 1 (presence) |
Wilderness area designation |
| Soil_Type (40 binary columns) |
qualitative |
0 (absence) or 1 (presence |
Soil Type designation |
| Cover_Type (7 types) |
integer |
1 to 7 |
Forest Cover Type designation |
Las cuatro areas silvestres:
- Rawah Wilderness Area
- Neota Wilderness Area
- Comanche Peak Wilderness Area
- Cache la Poudre Wilderness Area
Tipo de covertura forestal:
- Spruce/Fir (picea -pino-/abeto)
- Lodgepole Pine (pino)
- Ponderosa Pine (pino)
- Cottonwood/Willow (sauces)
- Aspen (alamos)
- Douglas-fir (pino oregon)
- Krummholz (vegetación atrofiada y deformada, arbol “bandera”)
Missings
No hay missings.
colSums(is.na(df))
Elevation Aspect
0 0
Slope Horizontal_Distance_To_Hydrology
0 0
Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways
0 0
Hillshade_9am Hillshade_Noon
0 0
Hillshade_3pm Horizontal_Distance_To_Fire_Points
0 0
Wilderness_Area_1 Wilderness_Area_2
0 0
Wilderness_Area_3 Wilderness_Area_4
0 0
Soil_Type1 Soil_Type2
0 0
Soil_Type3 Soil_Type4
0 0
Soil_Type5 Soil_Type6
0 0
Soil_Type7 Soil_Type8
0 0
Soil_Type9 Soil_Type10
0 0
Soil_Type11 Soil_Type12
0 0
Soil_Type13 Soil_Type14
0 0
Soil_Type15 Soil_Type16
0 0
Soil_Type17 Soil_Type18
0 0
Soil_Type19 Soil_Type20
0 0
Soil_Type21 Soil_Type22
0 0
Soil_Type23 Soil_Type24
0 0
Soil_Type25 Soil_Type26
0 0
Soil_Type27 Soil_Type28
0 0
Soil_Type29 Soil_Type30
0 0
Soil_Type31 Soil_Type32
0 0
Soil_Type33 Soil_Type34
0 0
Soil_Type35 Soil_Type36
0 0
Soil_Type37 Soil_Type38
0 0
Soil_Type39 Soil_Type40
0 0
Forest_Cover_Type
0
###Variables categoricas###
Una de las variables consiste en medidas del tipo de suelo presente. En el dataset esta variable se mapeo a 40 one-hot variables. Cada variable corresponde a un tipo de suelo, catalogado con un codigo que incluye informacion de la zona climatica y zona geologica. La distribucion de esta variable es la siguiente :
barplot(sort(table(df_suelos$suelo)),las=2, log="y", xlab = "suelo", col = "#1b98e0")
png('barsuelo.png')
barplot(sort(table(df_suelos$suelo)),las=2, log="y", xlab = "suelo", col = "#1b98e0")
dev.off()
png
2

El suelo mas abundante tiene el doble de frecuencia que el siguiente mas abundante y su frecuencia esta 4 ordenes de magnitud por encima del suelo menos frecuente. En vista de que son muchas variables individuales a ser consideradas y no pueden ser combinadas (no sin conocimiento experto en geologia) para reducir su numero. No sera considerada en el analisis.
Otra de las variables categoricas es la zona silvestre de la celda. La abundancia de zonas:
barplot(sort(table(df_areas$areas)),las=2, xlab = "areas silvestres", col = "#F4A582")
png('barareas.png')
barplot(sort(table(df_areas$areas)),las=2, xlab = "areas silvestres", col = "#F4A582")
dev.off()
png
2

Las areas silvestres 3 y 4 son alrededor de 8 veces mas abundantes en el dataset que las 1 y 2. Tambien se podria usar para clasificar pero se escoge la clase de cobertura forestal.
Descripcion de la variable objetivo, las clases de cobertura vegetal:
library(plotly)
labels = c('Spruce/Fir','Lodgepole Pine','Ponderosa Pine','Cottonwood/Willow','Aspen','Douglas-fir',
'Krummholz')
values <- aggregate(df$Elevation~ df$Forest_Cover_Type, data=df, FUN=length)
fig <- plot_ly(type='pie', labels=labels, values=values$`df$Elevation`,
textinfo='label+percent',
insidetextorientation='radial',showlegend = FALSE)
fig
#png('covertype.png')
#fig
#dev.off()
#pie(table(df$Forest_Cover_Type),main='Tipos de covertura')
#barplot(table(df$World.region), xlab = "Region", ylab = "Cantidad de Ciudades", main="Ciudades por region",cex.names = .3, las=2)
Se puede observar que los datos estan bastante desbalanceados. Esto incidiría en una regresión pero no en un LDA.
Resumen de los datos:
df_max<-apply(df[,1:10],2,max)
df_min<-apply(df[,1:10],2,min)
df_n <- sweep(df[,1:10], 2, df_min, "-")
range <- df_max-df_min
df_n <- sweep(df_n, 2, range, "/")
boxplot(df_n,las=2)

Todas las distribuciones son sesgadas y tienen muchos outliers. Para clasificar, si se va a aplicar un LDA, hay que garantizar normalidad y que las medias sean distintas, pero se vera mas adelante. Tambien es importante garantizar que los atributos sean independientes.
La variable Aspect es bimodal y probablemente Slope es multimodal. Todas las variables tienen colas largas, excepto Hillshade_3pm. En todas las variables excepto en Elevation las clases siguen la misma tendencia. En esta variable las distribuciones para cada grupo estan centradas distinto:
ggplot2.multiplot(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10, cols=4)
png('histograms.png')
ggplot2.multiplot(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10, cols=3)
dev.off()
png
2

Correlaciones
library(ggcorrplot)
corr <- round(cor(subset0[,1:10]), 1)
p.mat<- cor_pmat(subset0[,1:10])
ggcorrplot(corr, lab = TRUE,
outline.col = "white",lab_size = 3, tl.cex=5,method = "circle",hc.order = TRUE, p.mat = p.mat)
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
png('correlation.png')
ggcorrplot(corr, lab = TRUE,
outline.col = "white",lab_size = 3, tl.cex=5, method = "circle",hc.order = TRUE, p.mat = p.mat)
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
dev.off()
png
2

Las variables hillshade estan relacionadas con aspect, slope y entre ellas. Esto tiene que ver con como se calculan las cantidades hillshade que son los sombreados que se dibujan sobre los mapas cartograficos. Se supone una iluminacion simulada que depende de la orientacion a la fuente de luz, la cual esta basada en las variables aspect y slope. Las cantidades distancia vertical y horizontal a cursos de agua tambien estan relacionadas y se calculan con slope y elevation. La distancia horizontal a caminos y a puntos de fuego tambien se calculan con la elevacion y estan relacionados entre si.
Con base en esto acoto las variables a Elevation, Aspect, Slope, Hillshade_noon, Horizontal_Distance_Roadways y Horizontal_Distance_to_Hydrology.
library(GGally)
library(data.table)
color <- as.factor(subset[,7])
ggpairs(subset[,1:5], aes(color = color, alpha = 0.5),lower = list(combo = "count"),upper = "blank",)

Se pueden observar tambien las distribuciones de las clases, son sesgadas como se vio en los boxplots. La variable que podria servir mas para distinguir clases podria ser Elevation.
subsubset <- subset[,c(1:4,6,8,11)]
Analisis explotario con PCA
library(ggbiplot)
library(plyr)
datos_pca <- prcomp(subset[,1:5,7],scale=TRUE)
p <- ggbiplot(datos_pca, obs.scale=0.01,alpha = 0.3,groups=subset$cover_type)
p <- p + xlim(-3, 2) + ylim(-2, 3)
plot(p)
png('pca.png')
plot(p)
dev.off()
png
2

LDA
Habiendo determinado que la variable de clase es discriminante, calculamos el lda:
library(MASS)
modelo_lda <- lda(formula = cover_type ~ Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Roadways + Hillshade_Noon, data = subset)
Ahora se prueba el modelo:
predicciones <- predict(object = modelo_lda, newdata = testeo, method = "predictive")
table(clase, predicciones$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2 3 4 5 6 7
1 510 203 0 0 0 0 7
2 199 791 11 0 0 10 0
3 0 24 77 1 0 6 0
4 0 0 10 1 0 0 0
5 1 32 0 0 0 0 0
6 0 20 45 0 0 8 0
7 62 2 0 0 0 0 5
Los errores de prediccion:
training_error <- mean(clase != predicciones$class) * 100
training_error
[1] 31.25926
El error da bastante grande…
Con una variable mas:
library(klaR)
testeo$clase <- as.factor(testeo$clase)
partimat(clase ~ ., data = testeo, method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7))


Sin esa variable:
library(klaR)
testeo$clase <- as.factor(testeo$clase)
partimat(clase ~ ., data = testeo[,c(1:5,7)], method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),scaled=TRUE)
png('partiplot.png')
partimat(clase ~ ., data = testeo[,c(1:5,7)], method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),scaled=TRUE)
dev.off()
png
2

Discrimina muy mal. Ahora, si transformo las variables con un log:
library(IDPmisc)
df_trans <- log(subset[,1:6])
subset_trans <- cbind(df_trans,cover_type)
subset_t <- NaRV.omit(subsubset_trans)
subset_t$cover_type <- as.factor(subset_t$cover_type)
library(MASS)
modelo_lda_trans <- lda(formula = cover_type ~. , data = subset_t)
library(IDPmisc)
library(dplyr)
test_trans <- log(test)
test_t <- cbind(test_trans,clase)
test_t <- NaRV.omit(test_t)
clase_t <-test_t$clase
predicciones_t <- predict(object = modelo_lda_trans, newdata = test_t)
table(clase_t, predicciones_t$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2 3 4 5 6 7
1 0 682 0 0 0 0 0
2 0 969 0 0 0 0 0
3 0 103 0 0 0 0 0
4 0 8 0 0 0 0 0
5 0 29 0 0 0 0 0
6 0 68 0 0 0 0 0
7 0 67 0 0 0 0 0
training_error_t <- mean(clase_t != predicciones_t$class) * 100
training_error_t
[1] 49.68847
Pesimo.
Si considero una variable mas, sin transformar:
library(MASS)
modelo_lda_2 <- lda(formula = cover_type ~. , data = subset_2)
predicciones_2 <- predict(object = modelo_lda_2, newdata = testeo_2)
table(testeo_2$clase, predicciones_2$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2 3 4 5 6 7
1 516 200 0 0 0 0 4
2 201 790 11 0 0 9 0
3 0 25 69 4 0 10 0
4 0 0 10 1 0 0 0
5 0 33 0 0 0 0 0
6 0 19 44 0 0 10 0
7 63 1 0 0 0 0 5
training_error_2 <- mean(testeo_2$clase != predicciones_2$class) * 100
training_error_2
[1] 31.30864
Considerar las variables con logaritmo o considerar una variable adicional no mejoro la prediccion. Se intentara ahora con un lda robusto:
library(MASS)
modelo_qda <- qda(formula = cover_type ~ ., data = subset)
predicciones_qda <- predict(object = modelo_qda, newdata = testeo, method = "predictive")
table(clase, predicciones_qda$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2 3 4 5 6 7
1 66 15 0 4 161 37 437
2 92 66 1 30 572 86 164
3 0 0 3 68 8 29 0
4 0 0 0 11 0 0 0
5 2 0 0 0 29 2 0
6 0 0 4 37 7 25 0
7 0 0 0 0 1 0 68
training_error_qda <- mean(clase != predicciones_qda$class) * 100
training_error_qda
[1] 86.76543
Da peor…
Con las clases escaladas:
datos_escalados <- as.data.frame(scale(subset[,1:5]))
subset_esc <- cbind(datos_escalados,cover_type)
subset_esc$cover_type <- as.factor(subset_esc$cover_type)
test de Shapiro para los datos escalados:
subset_esc$cover_type <- as.factor(subset_esc$cover_type)
library(mvnormtest)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='1',1:5]))
shapitest <- test$p.value
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='2',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='3',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='4',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='5',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='6',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='7',1:5]))
shapitest <- append(shapitest,test$p.value)
shapitest
[1] 5.239585e-15 2.098729e-18 1.165629e-01 5.931935e-04 1.401489e-09 1.025656e-02 1.953662e-10
library(MASS)
modelo_lda_esc <- lda(formula = cover_type ~., data = subset_esc)
test_esc <- as.data.frame(scale(test))
test_esc <- cbind(test_esc,clase)
test_esc$clase <- as.factor(test_esc$clase)
predicciones_esc <- predict(object = modelo_lda_esc, newdata = test_esc, method = "predictive")
table(clase, predicciones_esc$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2 3 4 5 6 7
1 521 192 0 0 0 0 7
2 216 774 11 0 0 10 0
3 0 25 76 1 0 6 0
4 0 0 10 1 0 0 0
5 1 32 0 0 0 0 0
6 0 21 43 0 0 9 0
7 63 1 0 0 0 0 5
training_error_esc <- mean(clase != predicciones_esc$class) * 100
training_error_esc
[1] 31.55556
El mejor resultado fue el inicial. En general se predicen muy mal las clases menos abundantes.
El qda con los datos escalados:
library(MASS)
modelo_qda_esc <- qda(formula = cover_type ~ ., data = subset[,1:5,7])
predicciones_qda_esc <- predict(object = modelo_qda_esc, newdata = testeo[,1:5,7], method = "predictive")
table(clase, predicciones_qda_esc$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2 3 4 5 6 7
1 115 21 0 2 168 10 404
2 129 99 0 15 552 74 142
3 0 0 6 63 8 31 0
4 0 0 0 10 0 1 0
5 2 0 0 0 29 2 0
6 0 0 0 39 7 27 0
7 0 0 0 0 1 0 68
training_errorqda_esc <- mean(clase != predicciones_qda_esc$class) * 100
training_errorqda_esc
[1] 82.51852
Se realizara el analisis con solo dos clases:
subsubset2 <-subset[(subset$cover_type==2),]
subsubset1 <-subset[(subset$cover_type==1),]
subsubset <-rbind(subsubset1,subsubset2)
subsubset$cover_type <- droplevels(subsubset$cover_type)
El test de normalidad multivariada:
library(mvnormtest)
testing <- mshapiro.test(t(subsubset[subsubset$cover_type=='1',1:6]))
shapitest <- testing$p.value
testing <- mshapiro.test(t(subsubset[subsubset$cover_type=='2',1:6]))
shapitest <- append(shapitest,testing$p.value)
shapitest
[1] 2.929115e-30 3.038726e-36
Rechaza normalidad para cada variable para cada nivel de la variable.
Con las variables estandarizadas:
library(clusterSim)
#subsubset_es <- scale(subsubset[,1:6], center = median(subsubset))
subsubset_es <- data.Normalization (subsubset[,1:6],type="n2",normalization="column")
subsubset_es <- cbind(subsubset_es,subsubset[,7])
Con las variables estandarizadas y sacando hillshade:
library(mvnormtest)
subsubset_2 <- subsubset_es[,c(1:5,7)]
testing <- mshapiro.test(t(subsubset_2[subsubset_2$cover_type=='1',1:5]))
shapitest <- testing$p.value
testing <- mshapiro.test(t(subsubset_2[subsubset_2$cover_type=='2',1:5]))
shapitest <- append(shapitest,testing$p.value)
shapitest
[1] 5.239585e-15 2.098729e-18
Rechaza, sin sacar y sacando hillshade.
Testeando homocedasticidad:
library(biotools)
boxM(data=subsubset[,1:6],grouping=subsubset[,7])
Rechaza igualdad de varianzas.
Como el test M de Box es sensible a la falta de normalidad, realizamos el de Levene:
library(car)
Loading required package: carData
Attaching package: ‘car’
The following object is masked from ‘package:dplyr’:
recode
leveneTest( Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Roadways + Hillshade_Noon ~ subsubset$cover_type, data = subsubset)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 24.874 6.365e-07 ***
4268
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
El valor es menor que el nivel de significancia 0.001. Rechazamos la hipotesis nula y concluimos que las varianzas no son iguales.
De todas maneras hacemos el test de Hotelling para testear diferencia de medias:
library(Hotelling)
fitProd = hotelling.test(.~ subsubset$cover_type, data = subsubset)
fitProd
Rechaza igualdad de medias, pero como no se cumple la normalidad multivariada este resultado no es confiable.
LDA con dos clases
el conjunto de test para dos clases:
library(dplyr)
testeo2 <-testeo[(testeo$clase==2),]
testeo1 <-testeo[(testeo$clase==1),]
testeo_2class <-rbind(testeo1,testeo2)
testeo_2class$clase <- droplevels(testeo_2class$clase)
clase_2class <- testeo_2class[,7]
test_2class <-testeo_2class[,1:6]
library(MASS)
modelo_lda_2class <- lda(formula = cover_type ~. , data = subsubset)
la clasificacion ingenua:
predicciones_2class_0 <- predict(object = modelo_lda_2class, newdata = subsubset)
table(subsubset$cover_type, predicciones_2class_0$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2
1 1343 488
2 525 1914
training_error_2cl_0 <- mean(subsubset$cover_type != predicciones_2class_0$class) * 100
training_error_2cl_0
[1] 23.72365
predicciones_2class <- predict(object = modelo_lda_2class, newdata = testeo_2class)
table(clase_2class, predicciones_2class$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2
1 516 204
2 199 812
training_error_2cl <- mean(clase_2class != predicciones_2class$class) * 100
training_error_2cl
[1] 23.28134
library(klaR)
partimat(subsubset$cover_type ~ ., data = subsubset, method = "lda", col.correct=NA, col.wrong=NA, plot.matrix=FALSE,image.colors =rainbow(2))


La variable que mejor separa es elevacion.
Con las variables estandarizadas:
library(MASS)
modelo_lda_2class_est <- lda(formula = cover_type ~. , data = subsubset_es)
la clasificacion ingenua:
predicciones_2class_es_0 <- predict(object = modelo_lda_2class_est, newdata = subsubset_es)
table(subsubset_es$cover_type, predicciones_2class_es_0$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2
1 1343 488
2 525 1914
training_error_2cl_es_0 <- mean(subsubset_es$cover_type != predicciones_2class_es_0$class) * 100
training_error_2cl_es_0
[1] 23.72365
library(clusterSim)
#subsubset_es <- scale(subsubset[,1:6], center = median(subsubset))
testeo_es <- data.Normalization (testeo_2class[,1:6],type="n2",normalization="column")
testeo_es <- cbind(testeo_es,testeo_2class$clase)
testeo_es <- rename(testeo_es, cover_type=`testeo_2class$clase`)
predicciones_2class_es <- predict(object = modelo_lda_2class_est, newdata = testeo_es)
table(testeo_es$clase, predicciones_2class_es$class, dnn = c("Clase real", "Clase predicha"))
training_error_2cl_es <- mean(testeo_es$clase != predicciones_2class_es$class) * 100
training_error_2cl_es
[1] 23.62796
Haciendo el discriminante robusto:
library(MASS)
modelo_qda_2class_est <- qda(formula = cover_type ~. , data = subsubset_es)
prediccionesqda_2c_es_0 <- predict(object = modelo_qda_2class_est, newdata = subsubset_es)
table(subsubset_es$cover_type, prediccionesqda_2c_es_0$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2
1 1397 434
2 613 1826
training_errorqda_2cl_es_0 <- mean(subsubset_es$cover_type != prediccionesqda_2c_es_0$class) * 100
training_errorqda_2cl_es_0
[1] 24.51991
prediccionesqda_2c_es <- predict(object = modelo_qda_2class_est, newdata = testeo_es)
table(testeo_es$clase, prediccionesqda_2c_es$class, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2
1 559 161
2 252 759
training_errorqda_2cl_es <- mean(testeo_es$clase != prediccionesqda_2c_es$class) * 100
training_errorqda_2cl_es
[1] 23.85904
Incluso empeora un poco.
Support Vector machine
Clasificando con svm el dataset con las 7 clases:
library(e1071)
model.svm = svm( subset$cover_type ~ ., data = subset[,c(1:5,7)], kernel = "radial", cost = 10, scale = TRUE)
print(model.svm)
Call:
svm(formula = subset$cover_type ~ ., data = subset[, c(1:5, 7)], kernel = "radial", cost = 10,
scale = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 10
Number of Support Vectors: 3264
predicciones_svm = predict(model.svm, subset[,c(1:5,7)])
table(subset$cover_type, predicciones_svm, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2 3 4 5 6 7
1 1342 465 0 0 0 0 24
2 373 2035 24 0 0 4 3
3 0 60 225 0 0 9 0
4 0 0 7 13 0 1 0
5 0 60 0 0 18 0 0
6 0 66 46 0 0 35 0
7 86 4 0 0 0 0 100
training_error_svm_0 <- mean(subset$cover_type != predicciones_svm) * 100
training_error_svm_0
[1] 24.64
El kernel que mejor da es el radial.
predicciones_svm_te = predict(model.svm, testeo)
table(testeo$clase, predicciones_svm_te, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2 3 4 5 6 7
1 499 209 0 0 0 0 12
2 169 821 13 0 3 4 1
3 0 26 76 1 0 5 0
4 0 0 8 3 0 0 0
5 0 29 0 0 4 0 0
6 0 27 38 0 0 8 0
7 37 3 0 0 0 0 29
training_error_svm <- mean(testeo$clase != predicciones_svm_te) * 100
training_error_svm
[1] 28.88889
Si lo hacemos con menos variables:
subsubset1 <-subset[(subset$cover_type==1),]
subsubset2 <-subset[(subset$cover_type==2),]
subsubset <-rbind(subsubset1,subsubset2)
subsubset$cover_type <- droplevels(subsubset$cover_type)
library(e1071)
model.svm_2 = svm( subsubset$cover_type ~ ., data = subsubset[,c(1:5,7)], kernel = "radial", cost = 10, scale = TRUE)
print(model.svm)
Call:
svm(formula = subset$cover_type ~ ., data = subset[, c(1:5, 7)], kernel = "radial", cost = 10,
scale = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 10
Number of Support Vectors: 3264
predicciones_svm_2 = predict(model.svm_2, subsubset[,c(1:5,7)])
table(subsubset$cover_type, predicciones_svm_2, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2
1 1359 472
2 374 2065
training_error_svm_2 <- mean(subsubset$cover_type != predicciones_svm_2) * 100
training_error_svm_2
[1] 19.81265
Ahora con el test:
predicciones_svm_te_2 = predict(model.svm_2, testeo_2class)
table(testeo_2class$clase, predicciones_svm_te_2, dnn = c("Clase real", "Clase predicha"))
Clase predicha
Clase real 1 2
1 501 219
2 170 841
training_error_svm_2_te <- mean(testeo_2class$clase != predicciones_svm_te_2) * 100
training_error_svm_2_te
[1] 22.47256
Graficando para todas las clases:
plot(model.svm, data=subset[,c(1:5,7)],Elevation ~ Aspect, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Slope, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Slope, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm, data=subset[,c(1:5,7)], Horizontal_Distance_To_Hydrology ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)],Elevation ~ Aspect, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Slope, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Slope, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

plot(model.svm_2, data=subsubset[,c(1:5,7)], Horizontal_Distance_To_Hydrology ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

Se dibujan mas fronteras cuando la variable Elevation esta presente.
Clasificacion no supervisada
Para la clasificacion no supervisada, comenzamos con metodos jerarquicos porque k-means es sensible a outliers:
datos_clust <- subset
mat_dist <- dist(x = subset, method = "euclidean")
Dendrogramas para distintos metodos, suponemos 7 clusters:
hc_complete <- hclust(d = mat_dist, method = "complete")
hc_average <- hclust(d = mat_dist, method = "average")
hc_single <- hclust(d = mat_dist, method = "single")
hc_ward <- hclust(d = mat_dist, method = "ward.D2")
hc_cn <- hclust(d = mat_dist, method = "centroid")
Construyendo los dendrogramas:
cantidad_clusters = 7
plot(hc_complete)
rect.hclust(hc_complete, k=cantidad_clusters, border="red") #

jer_complete<-cutree(hc_complete,k=cantidad_clusters) #
datos_clust$jer_complete=jer_complete
cantidad_clusters = 7
plot(hc_average)
rect.hclust(hc_average, k=cantidad_clusters, border="red") #

jer_average<-cutree(hc_average,k=cantidad_clusters) #
datos_clust$jer_average=jer_average
cantidad_clusters = 7
plot(hc_single)
rect.hclust(hc_single, k=cantidad_clusters, border="red") #

jer_single<-cutree(hc_single,k=cantidad_clusters) #
#datos$jer_complete=jer_complete
Da feisimo…
cantidad_clusters = 7
plot(hc_ward)
rect.hclust(hc_ward, k=cantidad_clusters, border="red") #

jer_ward<-cutree(hc_ward,k=cantidad_clusters) #
datos_clust$jer_ward=jer_ward
cantidad_clusters = 7
plot(hc_cn)
rect.hclust(hc_cn, k=cantidad_clusters, border="red") #

jer_cn<-cutree(hc_cn,k=cantidad_clusters) #
datos_clust$jer_cn=jer_cn
Los coeficientes de correlacion cofenetica:
cor(x = mat_dist, cophenetic(hc_complete))
[1] 0.7768551
cor(x = mat_dist, cophenetic(hc_average))
[1] 0.7736942
cor(x = mat_dist, cophenetic(hc_single))
[1] 0.1769797
cor(x = mat_dist, cophenetic(hc_ward))
[1] 0.6320018
cor(x = mat_dist, cophenetic(hc_cn))
[1] 0.7774966
El linkage single es el peor.
datos_clust$jer_complete <- as.factor(datos_clust$jer_cn)
p1 <- ggplot(datos_clust, aes(x=Elevation, y=Aspect, color=jer_cn)) +
geom_point() + scale_color_brewer(palette="Dark2")
p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=jer_cn)) +
geom_point()
p2 + scale_color_brewer(palette="Dark2")
p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
geom_point()
p3 + scale_color_brewer(palette="Dark2")
p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
geom_point()
p4 + scale_color_brewer(palette="Dark2")
library(ggplot2)
library(easyGgplot2)
datos_clust$jer_complete <- as.factor(datos_clust$jer_cn)
p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=jer_cn, alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")
p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=jer_cn,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none") +labs(y='H_Hydrology')
p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')
p7 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')
p9 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')
p10 <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(x='H_Hydrology',y='H_Roadways')
ggplot2.multiplot(p2,p3,p4,p7,p9,p10,cols=3)
png('jerarquical.png')
ggplot2.multiplot(p2,p3,p4,p7,p9,p10,cols=3)
dev.off()
png
2

library(ggplot2)
p <- ggplot(datos_clust, aes(x=Aspect, y=Slope, color=jer_cn)) +
geom_point()
p + scale_color_brewer(palette="Dark2")
p <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
geom_point()
p + scale_color_brewer(palette="Dark2")
p <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
geom_point()
p + scale_color_brewer(palette="Dark2")
p <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
geom_point()
p + scale_color_brewer(palette="Dark2")
library(ggplot2)
p <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
geom_point()
p + scale_color_brewer(palette="Dark2")
p <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=jer_complete)) +
geom_point()
p + scale_color_brewer(palette="Dark2")
Solo lo hice con el primer metodo de cluster jerarquico que fue el que mejor dio el coeficiente cofrenetico.
library(cluster)
datos_kmeans = datos_clust[1:5]
cantidad_clusters=7
CL = kmeans(scale(datos_kmeans),cantidad_clusters)
datos_kmeans$kmeans = CL$cluster
library(ggplot2)
library(easyGgplot2)
datos_kmeans$kmeans <- as.factor(datos_kmeans$kmeans)
p1 <- ggplot(datos_clust, aes(x=Elevation, y=Aspect, color=datos_kmeans$kmeans, alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")
p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=datos_kmeans$kmeans, alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")
p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none") +labs(y='H_Hydrology')
p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')
p5 <- ggplot(datos_clust, aes(x=Aspect, y=Slope, color=datos_kmeans$kmeans, alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")
p6 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans, alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none") +labs(y='H_Hydrology')
p7 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')
p8 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')
p9 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')
p10 <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(x='H_Hydrology',y='H_Roadways')
ggplot2.multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,cols=3)
png('jerarquical_2.png')
ggplot2.multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,cols=3)
dev.off()
png
2

---
title: "Trabajo Práctico AID"
output: html_notebook
---
```{r include = FALSE}
knitr::opts_chunk$set(echo=FALSE)
```

# Descripcion base de datos

### Titulo de la base de datos:

Forest Covertype data, Remote Sensing and GIS Program Department of Forest Sciences College of Natural Resources Colorado State University.

El dataset consiste en medidas cartograficas de areas silvestres en el Bosque Nacional Roosevelt en el 
norte de Colorado en USA y el tipo de covertura vegetal presente. Se realizan en celdas  de 30 X 30 metros. 
Estos bosques tienen poca intervencion 
humana, por ende las areas caracterizan no usos (como agricultura, poblacion etc) sino clase de bosque 
predominante. En total son 12 medidas cartograficas y 7 grandes tipos de covertura vegetal.

El dataset:
	
Numero de columnas:
```{r}
length(df)
```

Numero de observaciones:  
```{r}

nrow(df)
```
### Atributos:

```{r}
names(df[,c(1:10,55)])
```
Otros atributos son cuatro areas silvestres y 40 tipos de suelo.

El tipo de atributo, las unidades y su descripcion:

Name                              |   Data Type       |      Measurement        |            Description
----------------------------------|-------------------|-------------------------|----------------------------------------
Elevation                         |   quantitative    |  meters                 |  Elevation in meters
Aspect                            |   quantitative    |  azimuth                |  Aspect in degrees azimuth
Slope                             |   quantitative    |  degrees                |  Slope in degrees
Horizontal_Distance_To_Hydrology  |   quantitative    |  meters                 |  Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology    |   quantitative    |  meters                 |  Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways   |   quantitative    |  meters                 |  Horz Dist to nearest roadway
Hillshade_9am                     |   quantitative    |  0 to 255 index         |  Hillshade index at 9am, summer solstice
Hillshade_Noon                    |   quantitative    |  0 to 255 index         |  Hillshade index at noon, summer soltice
Hillshade_3pm                     |   quantitative    |  0 to 255 index         |  Hillshade index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points|   quantitative    |  meters                 |  Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns)|   qualitative     |  0 (absence) or 1 (presence)|  Wilderness area designation
Soil_Type (40 binary columns)     |   qualitative     |  0 (absence) or 1 (presence |  Soil Type designation
Cover_Type (7 types)              |   integer         |  1 to 7                     |  Forest Cover Type designation

Las cuatro areas silvestres:  	

* Rawah Wilderness Area
* Neota Wilderness Area
* Comanche Peak Wilderness Area
* Cache la Poudre Wilderness Area


Tipo de covertura forestal:	

* Spruce/Fir (picea -pino-/abeto)
* Lodgepole Pine (pino)
* Ponderosa Pine (pino)
* Cottonwood/Willow (sauces)
* Aspen (alamos)
* Douglas-fir (pino oregon)
* Krummholz (vegetación atrofiada y deformada, arbol "bandera")

### Missings

No hay missings.

```{r}
colSums(is.na(df))
```
###Variables categoricas###

Una de las variables consiste en medidas del tipo de suelo presente. En el dataset esta variable se mapeo a 40 one-hot variables. Cada variable corresponde a un tipo de suelo, catalogado con un codigo que incluye informacion de la zona climatica y zona geologica. La distribucion de esta variable es la siguiente :

```{r}
library(dplyr)

df_suelos <- df[,15:54]

for (j in 1:length(df_suelos)){
    for (i in 1:nrow(df_suelos)){
        if (df_suelos[i,j]==1){
            df_suelos[i,41]<-paste("Soil_Type",j,sep = "")
        }
    }
}
df_suelos <- rename(df_suelos, suelo=V41)

barplot(sort(table(df_suelos$suelo)),las=2, log="y", xlab = "suelo", col = "#1b98e0")
png('barsuelo.png')
barplot(sort(table(df_suelos$suelo)),las=2, log="y", xlab = "suelo", col = "#1b98e0")

dev.off()

```

El suelo mas abundante tiene el doble de frecuencia que el siguiente mas abundante y su frecuencia esta 4 ordenes de magnitud por encima del suelo menos frecuente. En vista de que son muchas variables individuales a ser consideradas y no pueden ser combinadas (no sin conocimiento experto en geologia) para reducir su numero. No sera considerada en el analisis.

Otra de las variables categoricas es la zona silvestre de la celda. La abundancia de zonas:

```{r}
library(dplyr)

df_areas <- df[,11:14]

for (j in 1:length(df_areas)){
    for (i in 1:nrow(df_areas)){
        if (df_areas[i,j]==1){
            df_areas[i,5]<-paste("Area",j,sep = "")
        }
    }
}
df_areas<- rename(df_areas, areas=V5)

barplot(sort(table(df_areas$areas)),las=2, xlab = "areas silvestres", col = "#F4A582")

png('barareas.png')

barplot(sort(table(df_areas$areas)),las=2, xlab = "areas silvestres", col = "#F4A582")

dev.off()
```
Las areas silvestres 3 y 4 son alrededor de 8 veces mas abundantes en el dataset que las 1 y 2. Tambien se podria usar para clasificar pero se escoge la clase de cobertura forestal.

Descripcion de la variable objetivo, las clases de cobertura vegetal:

```{r message=FALSE}
library(plotly)

labels = c('Spruce/Fir','Lodgepole Pine','Ponderosa Pine','Cottonwood/Willow','Aspen','Douglas-fir',
          'Krummholz')
values <- aggregate(df$Elevation~ df$Forest_Cover_Type, data=df, FUN=length)
fig <- plot_ly(type='pie', labels=labels, values=values$`df$Elevation`, 
               textinfo='label+percent',
               insidetextorientation='radial',showlegend = FALSE)
fig

#png('covertype.png')

#fig

#dev.off()

#pie(table(df$Forest_Cover_Type),main='Tipos de covertura')
#barplot(table(df$World.region), xlab = "Region", ylab = "Cantidad de Ciudades",         main="Ciudades por region",cex.names = .3, las=2) 
```

Se puede observar que los datos estan bastante desbalanceados. Esto incidiría en una regresión pero no en un LDA.

Resumen de los datos:

```{r}
df_max<-apply(df[,1:10],2,max)
df_min<-apply(df[,1:10],2,min)
df_n <- sweep(df[,1:10], 2, df_min, "-")
range <- df_max-df_min
df_n <- sweep(df_n, 2, range, "/")
boxplot(df_n,las=2)
```

Todas las distribuciones son sesgadas y tienen muchos outliers. Para clasificar, si se va a aplicar un LDA, hay que garantizar normalidad y que las medias sean distintas, pero se vera mas adelante. Tambien es importante garantizar que los atributos sean independientes.

La variable Aspect es bimodal y probablemente Slope es multimodal. Todas las variables tienen colas largas, excepto Hillshade_3pm. En todas las variables excepto en Elevation las clases siguen la misma tendencia. En esta variable las distribuciones para cada grupo estan centradas distinto:


```{r message=FALSE}
library(ggplot2)
library(easyGgplot2)


set.seed(42)

rows <- sample(nrow(df))
df_rand <- df[rows, ]
df_new <- df_rand[seq(1,500000,100),c(1:10)]
cover_type <- df_rand[seq(1,500000,100),55]
subset0 <- cbind(df_new,cover_type)
subset0$cover_type <- as.factor(subset0$cover_type)

# Histograma Elevation
plot1 <- ggplot2.histogram(data=subset0, xName='Elevation', groupName='cover_type', maintitle = "Histograma Elevation", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Elevation", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma Aspect
plot2 <- ggplot2.histogram(data=subset0, xName='Aspect',groupName='cover_type', maintitle = "Histograma Aspect", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Aspect", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma Slope
plot3 <- ggplot2.histogram(data=subset0, xName='Slope', groupName='cover_type',maintitle = "Histograma Slope", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Slope", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma h hydro
plot4 <-ggplot2.histogram(data=subset0, xName='Horizontal_Distance_To_Hydrology', groupName='cover_type',maintitle = "Histograma Horiz. distance to hydrology", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "H_hydrology", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma v hydro
plot5 <-ggplot2.histogram(data=subset0, xName='Vertical_Distance_To_Hydrology', groupName='cover_type',maintitle = "Histograma Vert. distance to hydrology", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "V_hydrology", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma h road
plot6 <-ggplot2.histogram(data=subset0, xName='Horizontal_Distance_To_Roadways', groupName='cover_type',maintitle = "Histograma Horiz. distance to roadways", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "H_roadways", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma hill 9
plot7 <-ggplot2.histogram(data=subset0, xName='Hillshade_9am', groupName='cover_type',maintitle = "Histograma Hillshade_9am", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Hillshade_9am", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma hill 12
plot8 <-ggplot2.histogram(data=subset0, xName='Hillshade_Noon', groupName='cover_type',maintitle = "Histograma Hillshade_noon", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Hillshade_noon", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma hill 3
plot9 <-ggplot2.histogram(data=subset0, xName='Hillshade_3pm', groupName='cover_type',maintitle = "Histograma Hillshade_3pm", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "Hillshade_3pm", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
# Histograma h firepoint
plot10 <-ggplot2.histogram(data=subset0, xName='Horizontal_Distance_To_Fire_Points', groupName='cover_type', maintitle = "Histograma Horiz. distance to firepoints", xtitleFont=c(10,"bold", "black"), ytitleFont=c(10,"bold", "black"), xTickLabelFont=c(7, "bold", "black"), yTickLabelFont=c(7, "bold", "black"), xtitle = "H_firepoints", ytitle = "Frecuencia", backgroundColor="white", brewerPalette="Paired",alpha=0.5,showLegend=FALSE,position="stack")
#

ggplot2.multiplot(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10, cols=4)

png('histograms.png')

ggplot2.multiplot(plot1,plot2,plot3,plot4,plot5,plot6,plot7,plot8,plot9,plot10, cols=3)

dev.off()
```

### Correlaciones


```{r}
library(ggcorrplot)

corr <- round(cor(subset0[,1:10]), 1)
p.mat<- cor_pmat(subset0[,1:10])

ggcorrplot(corr, lab = TRUE,
     outline.col = "white",lab_size = 3, tl.cex=5,method = "circle",hc.order = TRUE, p.mat = p.mat)

png('correlation.png')

ggcorrplot(corr, lab = TRUE,
     outline.col = "white",lab_size = 3, tl.cex=5, method = "circle",hc.order = TRUE, p.mat = p.mat)

dev.off()

```

Las variables hillshade estan relacionadas con aspect, slope y entre ellas. Esto tiene que ver con como se calculan las cantidades hillshade que son los sombreados que se dibujan sobre los mapas cartograficos. Se supone una iluminacion simulada que depende de la orientacion a la fuente de luz, la cual esta basada en las variables aspect y slope. Las cantidades distancia vertical y horizontal a cursos de agua tambien estan relacionadas y se calculan con slope y elevation. La distancia horizontal a caminos y a puntos de fuego tambien se calculan con la elevacion y estan relacionados entre si.

Con base en esto acoto las variables a Elevation, Aspect, Slope, Hillshade_noon, Horizontal_Distance_Roadways y Horizontal_Distance_to_Hydrology.


```{r message=FALSE, warning=FALSE, paged.print=FALSE}
library(GGally)
library(data.table)

color <- as.factor(subset[,7])
ggpairs(subset[,1:5], aes(color = color, alpha = 0.5),lower = list(combo = "count"),upper = "blank",)
```

Se pueden observar tambien las distribuciones de las clases, son sesgadas como se vio en los boxplots. La variable que podria servir mas para distinguir clases podria ser Elevation.


```{r}
subsubset <- subset[,c(1:4,6,8,11)]
```


### Analisis explotario con PCA
```{r message=FALSE}
library(ggbiplot)
library(plyr)

datos_pca <- prcomp(subset[,1:5,7],scale=TRUE)
p <- ggbiplot(datos_pca, obs.scale=0.01,alpha = 0.3,groups=subset$cover_type)
p <- p + xlim(-3, 2) + ylim(-2, 3)

plot(p)

png('pca.png')

plot(p)

dev.off()

```


# Clasificacion Supervisada

## Linear Discriminant Analysis

El primer metodo a implementar es LDA. Para que tenga validez el metodo deben cumplirse los supuestos de normalidad multivariada para cada nivel de cada variable y de homocedasticidad.

Para el test de normalidad multivariada usamos test de Shapiro:

```{r}
library(mvnormtest)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='1',1:6]))
shapitest <- test$p.value
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='2',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='3',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='4',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='5',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='6',1:6]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subsubset[subsubset$cover_type=='7',1:6]))
shapitest <- append(shapitest,test$p.value)

shapitest
```
Rechaza normalidad para todos los niveles de todas las variables. Para el total:

```{r}
library(mvnormtest)
mshapiro.test(t(subset[,1:6]))
```

Si el nivel de significacion por defecto es de 0.05, rechaza la normalidad multivariada. 

El siguiente supuesto de homocedasticidad lo evaluamos con el test M de Box:

```{r}
library(biotools)

boxM(data=subsubset[,1:6],grouping=subsubset[,7])
```
Rechaza la homogeneidad de varianzas. Se tendria que realizar un LDA robusto (cuadratico).

Igual, a pesar de haber rechazado la normalidad hacemos el test de Hotelling para determinar si las medias de los grupos son distintas:

```{r}
library(Hotelling)

fitProd = hotelling.test(.~ cover_type, data = subsubset) 
fitProd
```
Por ende rechazamos que las medias sean iguales. El cover_type es discriminante.

# LDA

Habiendo determinado que la variable de clase es discriminante, calculamos el lda:

```{r}

library(MASS)

modelo_lda <- lda(formula = cover_type ~ Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Roadways + Hillshade_Noon, data = subset)

```

Ahora se prueba el modelo:

```{r}
predicciones <-  predict(object = modelo_lda, newdata = testeo, method = "predictive")
table(clase, predicciones$class, dnn = c("Clase real", "Clase predicha"))
```
Los errores de prediccion:

```{r}
training_error <- mean(clase != predicciones$class) * 100 
training_error 
```
El error da bastante grande...

Con una variable mas:

```{r}
library(klaR) 

testeo$clase <- as.factor(testeo$clase)

partimat(clase ~ ., data = testeo, method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),)
```
Sin esa variable:

```{r}
library(klaR) 

testeo$clase <- as.factor(testeo$clase)

partimat(clase ~ ., data = testeo[,c(1:5,7)], method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),scaled=TRUE)

png('partiplot.png')

partimat(clase ~ ., data = testeo[,c(1:5,7)], method = "lda", col.correct=NA, col.wrong=NA,plot.matrix=FALSE,image.colors =rainbow(7),scaled=TRUE)

dev.off()

```

Discrimina muy mal. Ahora, si transformo las variables con un log:

```{r}
library(IDPmisc)

df_trans <- log(subset[,1:6])
subset_trans <- cbind(df_trans,cover_type)
subset_t <- NaRV.omit(subsubset_trans)
subset_t$cover_type <- as.factor(subset_t$cover_type)
```


```{r}
library(MASS)

modelo_lda_trans <- lda(formula = cover_type ~. , data = subset_t)

```

```{r}
library(IDPmisc)
library(dplyr)

test_trans <- log(test)
test_t <- cbind(test_trans,clase)
test_t <- NaRV.omit(test_t)
clase_t <-test_t$clase
```


```{r}
predicciones_t <-  predict(object = modelo_lda_trans, newdata = test_t)
table(clase_t, predicciones_t$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_t <- mean(clase_t != predicciones_t$class) * 100 
training_error_t 
```
Pesimo.

Si considero una variable mas, sin transformar:


```{r}

library(MASS)

modelo_lda_2 <- lda(formula = cover_type ~. , data = subset_2)

```
```{r}
predicciones_2 <-  predict(object = modelo_lda_2, newdata = testeo_2)
table(clase, predicciones_2$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_2 <- mean(testeo_2$clase != predicciones_2$class) * 100 
training_error_2 
```
Considerar las variables con logaritmo o considerar una variable adicional no mejoro la prediccion. Se intentara ahora con un lda robusto:



```{r}

library(MASS)

modelo_qda <- qda(formula = cover_type ~ ., data = subset)

```

```{r}
predicciones_qda <-  predict(object = modelo_qda, newdata = testeo, method = "predictive")
table(clase, predicciones_qda$class, dnn = c("Clase real", "Clase predicha"))
```
```{r}
training_error_qda <- mean(clase != predicciones_qda$class) * 100 
training_error_qda
```
Da peor...

Con las clases escaladas:

```{r}
datos_escalados <- as.data.frame(scale(subset[,1:5]))
subset_esc <- cbind(datos_escalados,cover_type)
subset_esc$cover_type <- as.factor(subset_esc$cover_type)
```

test de Shapiro para los datos escalados:

```{r}
subset_esc$cover_type <- as.factor(subset_esc$cover_type)
```

```{r}
library(mvnormtest)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='1',1:5]))
shapitest <- test$p.value
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='2',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='3',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='4',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='5',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='6',1:5]))
shapitest <- append(shapitest,test$p.value)
test <- mshapiro.test(t(subset_esc[subset_esc$cover_type=='7',1:5]))
shapitest <- append(shapitest,test$p.value)

shapitest
```

```{r}

library(MASS)

modelo_lda_esc <- lda(formula = cover_type ~., data = subset_esc)

```

```{r}
test_esc <- as.data.frame(scale(test))
test_esc <- cbind(test_esc,clase)
test_esc$clase <- as.factor(test_esc$clase)
```

```{r}
predicciones_esc <-  predict(object = modelo_lda_esc, newdata = test_esc, method = "predictive")
table(clase, predicciones_esc$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_esc <- mean(clase != predicciones_esc$class) * 100 
training_error_esc
```
El mejor resultado fue el inicial. En general se predicen muy mal las clases menos abundantes. 

El qda con los datos escalados:

```{r}

library(MASS)

modelo_qda_esc <- qda(formula = cover_type ~ ., data = subset[,1:5,7])

```

```{r}
predicciones_qda_esc <-  predict(object = modelo_qda_esc, newdata = testeo[,1:5,7], method = "predictive")
table(clase, predicciones_qda_esc$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_errorqda_esc <- mean(clase != predicciones_qda_esc$class) * 100 
training_errorqda_esc
```

Se realizara el analisis con solo dos clases:


```{r}
subsubset2 <-subset[(subset$cover_type==2),]
subsubset1 <-subset[(subset$cover_type==1),]
subsubset <-rbind(subsubset1,subsubset2)
subsubset$cover_type <- droplevels(subsubset$cover_type)
```

El test de normalidad multivariada:

```{r}
library(mvnormtest)

testing <- mshapiro.test(t(subsubset[subsubset$cover_type=='1',1:6]))
shapitest <- testing$p.value
testing <- mshapiro.test(t(subsubset[subsubset$cover_type=='2',1:6]))
shapitest <- append(shapitest,testing$p.value)

shapitest
```
Rechaza normalidad para cada variable para cada nivel de la variable.

Con las variables estandarizadas:

```{r}
library(clusterSim)

#subsubset_es <- scale(subsubset[,1:6], center = median(subsubset))
subsubset_es <- data.Normalization (subsubset[,1:6],type="n2",normalization="column")
subsubset_es <- cbind(subsubset_es,subsubset[,7])
subsubset_es <- rename(subsubset_es, cover_type=`subsubset[, 7]`)
```


Con las variables estandarizadas y sacando hillshade:

```{r}
library(mvnormtest)

subsubset_2 <- subsubset_es[,c(1:5,7)]
testing <- mshapiro.test(t(subsubset_2[subsubset_2$cover_type=='1',1:5]))
shapitest <- testing$p.value
testing <- mshapiro.test(t(subsubset_2[subsubset_2$cover_type=='2',1:5]))
shapitest <- append(shapitest,testing$p.value)

shapitest
```
Rechaza, sin sacar y sacando hillshade.

Testeando homocedasticidad:

```{r}
library(biotools)

boxM(data=subsubset[,1:6],grouping=subsubset[,7])
```
Rechaza igualdad de varianzas.

Como el test M de Box es sensible a la falta de normalidad, realizamos el de Levene:

```{r}
library(car)

leveneTest( Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology + Horizontal_Distance_To_Roadways + Hillshade_Noon ~ subsubset$cover_type, data = subsubset)
```

El valor es menor que el nivel de significancia 0.001. Rechazamos la hipotesis nula y concluimos que las varianzas no son iguales.

De todas maneras hacemos el test de Hotelling para testear diferencia de medias:

```{r}
library(Hotelling)

fitProd = hotelling.test(.~ subsubset$cover_type, data = subsubset) 
fitProd
```
Rechaza igualdad de medias, pero como no se cumple la normalidad multivariada este resultado no es confiable.

# LDA con dos clases

el conjunto de test para dos clases:

```{r}
library(dplyr)

testeo2 <-testeo[(testeo$clase==2),]
testeo1 <-testeo[(testeo$clase==1),]
testeo_2class <-rbind(testeo1,testeo2)
testeo_2class$clase <- droplevels(testeo_2class$clase)
clase_2class <- testeo_2class[,7]
test_2class <-testeo_2class[,1:6]

```


```{r}

library(MASS)

modelo_lda_2class <- lda(formula = cover_type ~. , data = subsubset)

```

la clasificacion ingenua:

```{r}
predicciones_2class_0 <-  predict(object = modelo_lda_2class, newdata = subsubset)
table(subsubset$cover_type, predicciones_2class_0$class, dnn = c("Clase real", "Clase predicha"))
```
```{r}
training_error_2cl_0 <- mean(subsubset$cover_type != predicciones_2class_0$class) * 100 
training_error_2cl_0
```

```{r}
predicciones_2class <-  predict(object = modelo_lda_2class, newdata = testeo_2class)
table(clase_2class, predicciones_2class$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_2cl <- mean(clase_2class != predicciones_2class$class) * 100 
training_error_2cl
```

```{r}
library(klaR) 

partimat(subsubset$cover_type ~ ., data = subsubset, method = "lda", col.correct=NA, col.wrong=NA, plot.matrix=FALSE,image.colors =rainbow(2))
```
La variable que mejor separa es elevacion.


Con las variables estandarizadas:

```{r}

library(MASS)

modelo_lda_2class_est <- lda(formula = cover_type ~. , data = subsubset_es)

```

la clasificacion ingenua:

```{r}
predicciones_2class_es_0 <-  predict(object = modelo_lda_2class_est, newdata = subsubset_es)
table(subsubset_es$cover_type, predicciones_2class_es_0$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_2cl_es_0 <- mean(subsubset_es$cover_type != predicciones_2class_es_0$class) * 100 
training_error_2cl_es_0

```
```{r}
library(clusterSim)

#subsubset_es <- scale(subsubset[,1:6], center = median(subsubset))
testeo_es <- data.Normalization (testeo_2class[,1:6],type="n2",normalization="column")
testeo_es <- cbind(testeo_es,testeo_2class$clase)
testeo_es <- rename(testeo_es, clase=`testeo_2class$clase`)
```


```{r}
predicciones_2class_es <-  predict(object = modelo_lda_2class_est, newdata = testeo_es)
table(testeo_es$clase, predicciones_2class_es$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_2cl_es <- mean(testeo_es$clase != predicciones_2class_es$class) * 100 
training_error_2cl_es

```
Haciendo el discriminante robusto:

```{r}

library(MASS)

modelo_qda_2class_est <- qda(formula = cover_type ~. , data = subsubset_es)

```

```{r}
prediccionesqda_2c_es_0 <-  predict(object = modelo_qda_2class_est, newdata = subsubset_es)
table(subsubset_es$cover_type, prediccionesqda_2c_es_0$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_errorqda_2cl_es_0 <- mean(subsubset_es$cover_type != prediccionesqda_2c_es_0$class) * 100 
training_errorqda_2cl_es_0

```
```{r}
prediccionesqda_2c_es <-  predict(object = modelo_qda_2class_est, newdata = testeo_es)
table(testeo_es$clase, prediccionesqda_2c_es$class, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_errorqda_2cl_es <- mean(testeo_es$clase != prediccionesqda_2c_es$class) * 100 
training_errorqda_2cl_es

```
Incluso empeora un poco.

## Support Vector machine

Clasificando con svm el dataset con las 7 clases:

```{r}
library(e1071)

model.svm = svm( subset$cover_type ~ ., data = subset[,c(1:5,7)], kernel = "radial", cost = 10, scale = TRUE)
print(model.svm)

```
```{r}
predicciones_svm = predict(model.svm, subset[,c(1:5,7)])

table(subset$cover_type, predicciones_svm, dnn = c("Clase real", "Clase predicha"))
```
```{r}
training_error_svm_0 <- mean(subset$cover_type != predicciones_svm) * 100 
training_error_svm_0
```
El kernel que mejor da es el radial. 

```{r}
predicciones_svm_te = predict(model.svm, testeo)

table(testeo$clase, predicciones_svm_te, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_svm <- mean(testeo$clase != predicciones_svm_te) * 100 
training_error_svm
```

Si lo hacemos con menos variables:

```{r}
subsubset1 <-subset[(subset$cover_type==1),]
subsubset2 <-subset[(subset$cover_type==2),]
subsubset <-rbind(subsubset1,subsubset2)
subsubset$cover_type <- droplevels(subsubset$cover_type)
```


```{r}
library(e1071)

model.svm_2 = svm( subsubset$cover_type ~ ., data = subsubset[,c(1:5,7)], kernel = "radial", cost = 10, scale = TRUE)
print(model.svm)

```


```{r}
predicciones_svm_2 = predict(model.svm_2, subsubset[,c(1:5,7)])

table(subsubset$cover_type, predicciones_svm_2, dnn = c("Clase real", "Clase predicha"))
```
```{r}
training_error_svm_2 <- mean(subsubset$cover_type != predicciones_svm_2) * 100 
training_error_svm_2
```
Ahora con el test:

```{r}
predicciones_svm_te_2 = predict(model.svm_2, testeo_2class)

table(testeo_2class$clase, predicciones_svm_te_2, dnn = c("Clase real", "Clase predicha"))
```

```{r}
training_error_svm_2_te <- mean(testeo_2class$clase != predicciones_svm_te_2) * 100 
training_error_svm_2_te
```
Graficando para todas las clases:

```{r}
plot(model.svm, data=subset[,c(1:5,7)],Elevation ~ Aspect,  color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Slope, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Slope, color.palette=topo.colors)
```

```{r}
plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm, data=subset[,c(1:5,7)], Horizontal_Distance_To_Hydrology ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

```
```{r}
plot(model.svm_2, data=subsubset[,c(1:5,7)],Elevation ~ Aspect,  color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Slope, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Elevation ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Slope, color.palette=topo.colors)
```

```{r}
plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Aspect ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Hydrology, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Slope ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)
plot(model.svm_2, data=subsubset[,c(1:5,7)], Horizontal_Distance_To_Hydrology ~ Horizontal_Distance_To_Roadways, color.palette=topo.colors)

```


Se dibujan mas fronteras cuando la variable Elevation esta presente.

## Clasificacion no supervisada

Para la clasificacion no supervisada, comenzamos con metodos jerarquicos porque k-means es sensible a outliers:

```{r}
datos_clust <- subset
mat_dist <- dist(x = subset, method = "euclidean") 

```

Dendrogramas para distintos metodos, suponemos 7 clusters:

```{r}
hc_complete <- hclust(d = mat_dist, method = "complete") 
hc_average  <- hclust(d = mat_dist, method = "average")
hc_single   <- hclust(d = mat_dist, method = "single")
hc_ward     <- hclust(d = mat_dist, method = "ward.D2")
hc_cn     <- hclust(d = mat_dist, method = "centroid")
```

Construyendo los dendrogramas:

```{r}
cantidad_clusters = 7

plot(hc_complete)
rect.hclust(hc_complete, k=cantidad_clusters, border="red") #

jer_complete<-cutree(hc_complete,k=cantidad_clusters)           #
datos_clust$jer_complete=jer_complete

```
```{r}
cantidad_clusters = 7

plot(hc_average)
rect.hclust(hc_average, k=cantidad_clusters, border="red") #

jer_average<-cutree(hc_average,k=cantidad_clusters)           #
datos_clust$jer_average=jer_average
```

```{r}
cantidad_clusters = 7

plot(hc_single)
rect.hclust(hc_single, k=cantidad_clusters, border="red") #

jer_single<-cutree(hc_single,k=cantidad_clusters)           #
#datos$jer_complete=jer_complete
```

Da feisimo...

```{r}
cantidad_clusters = 7

plot(hc_ward)
rect.hclust(hc_ward, k=cantidad_clusters, border="red") #

jer_ward<-cutree(hc_ward,k=cantidad_clusters)           #
datos_clust$jer_ward=jer_ward
```


```{r}
cantidad_clusters = 7

plot(hc_cn)
rect.hclust(hc_cn, k=cantidad_clusters, border="red") #

jer_cn<-cutree(hc_cn,k=cantidad_clusters)           #
datos_clust$jer_cn=jer_cn
```


Los coeficientes de correlacion cofenetica:

```{r}
cor(x = mat_dist, cophenetic(hc_complete))
cor(x = mat_dist, cophenetic(hc_average))
cor(x = mat_dist, cophenetic(hc_single))
cor(x = mat_dist, cophenetic(hc_ward))
cor(x = mat_dist, cophenetic(hc_cn))
```
El linkage single es el peor.

```{r}
datos_clust$jer_complete <- as.factor(datos_clust$jer_cn)
p1 <- ggplot(datos_clust, aes(x=Elevation, y=Aspect, color=jer_cn)) +
  geom_point() + scale_color_brewer(palette="Dark2")

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=jer_cn)) +
  geom_point()
p2 + scale_color_brewer(palette="Dark2")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p3 + scale_color_brewer(palette="Dark2")

p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p4 + scale_color_brewer(palette="Dark2")
```


```{r}
library(ggplot2)
library(easyGgplot2)

datos_clust$jer_complete <- as.factor(datos_clust$jer_cn)

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=jer_cn, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none") +labs(y='H_Hydrology')


p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p7 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p9 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p10 <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=jer_cn,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(x='H_Hydrology',y='H_Roadways')

ggplot2.multiplot(p2,p3,p4,p7,p9,p10,cols=3)

png('jerarquical.png')

ggplot2.multiplot(p2,p3,p4,p7,p9,p10,cols=3)

dev.off()
```

```{r}
library(ggplot2)

p <- ggplot(datos_clust, aes(x=Aspect, y=Slope, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")


p <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Hydrology, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

```
```{r}
library(ggplot2)

p <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=jer_cn)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")

p <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=jer_complete)) +
  geom_point()
p + scale_color_brewer(palette="Dark2")


```
Solo lo hice con el primer metodo de cluster jerarquico que fue el que mejor dio el coeficiente cofrenetico.




```{r echo=TRUE}
library(cluster)

datos_kmeans = datos_clust[1:5]

cantidad_clusters=7

CL  = kmeans(scale(datos_kmeans),cantidad_clusters)
datos_kmeans$kmeans = CL$cluster
```

```{r}
library(ggplot2)
library(easyGgplot2)

datos_kmeans$kmeans <- as.factor(datos_kmeans$kmeans)

p1 <- ggplot(datos_clust, aes(x=Elevation, y=Aspect, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p2 <- ggplot(datos_clust, aes(x=Elevation, y=Slope, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p3 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none") +labs(y='H_Hydrology')

p4 <- ggplot(datos_clust, aes(x=Elevation, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p5 <- ggplot(datos_clust, aes(x=Aspect, y=Slope, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none")

p6 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans, alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2") + theme(legend.position = "none") +labs(y='H_Hydrology')

p7 <- ggplot(datos_clust, aes(x=Aspect, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p8 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Hydrology, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p9 <- ggplot(datos_clust, aes(x=Slope, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(y='H_Roadways')

p10 <- ggplot(datos_clust, aes(x=Horizontal_Distance_To_Hydrology, y=Horizontal_Distance_To_Roadways, color=datos_kmeans$kmeans,alpha=0.5)) +
  geom_point() + scale_color_brewer(palette="Dark2")+ theme(legend.position = "none")+labs(x='H_Hydrology',y='H_Roadways')

ggplot2.multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,cols=3)

png('jerarquical_2.png')

ggplot2.multiplot(p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,cols=3)

dev.off()
```


